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Abstract 

Part  of  the  Advanced  Automation  System  (AAS)  for  air-traffic  con¬ 
trol  is  a  protocol  to  permit  flight  hand-off  from  one  air-traffic  con¬ 
troller  to  another.  The  protocol  must  be  fault-tolerant  and,  therefore, 
is  subtle — an  ideal  candidate  for  the  application  of  formal  methods. 
This  paper  describes  a  formal  method  for  deriving  fault-tolerant  proto¬ 
cols  that  is  based  on  refinement  and  proof  outlines.  The  AAS  hand-off 
protocol  was  actually  derived  using  this  method;  that  derivation  is 
given. 


1  Introdution 

The  next-generation  air  traffic  control  system  for  the  United  States  is  cur¬ 
rently  being  built  under  contract  to  the  U S.  government  by  the  IBM  Federal 
Systems  Company  (recently  acquired  by  Loral  Corp.).  Admnced  Automation 

'This  author  is  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency 
under  NASA  Ames  grant  number  NAG  2-593,  Contract  N00140-87-C-8904  and  by  AFOSR 
grant  number  F496209310242.  The  views,  opinions,  and  findings  contained  in  this  report 
are  those  of  the  author  and  should  not  be  construed  as  an  official  Department  of  Defense 
position,  policy,  or  decision. 

’This  author  is  supported  in  part  by  the  Office  of  Naval  Research  under  contract  N00014- 
91-J-1219,  AFOSR  under  proposal  93NM312,  the  National  Science  Foundation  under  Grant 
CCR-8701103,  and  DARPA/NSF  Grant  CCR-9014363.  Any  opinions,  findings,  and  conclu¬ 
sions  or  recommendations  expressed  in  this  publication  are  those  of  the  author  and  do  not 
reflect  the  views  of  these  agencies. 
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System  (AAS)  (1]  is  a  large  distributed  system  that  must  function  correctly, 
even  if  hardware  components  fail. 

Design  errors  in  AAS  software  are  avoided  and  eliminated  by  a  host 
of  methods.  This  paper  discusses  one  of  them — the  formal  derivation  of  a 
protocol  from  its  specification — and  how  it  was  applied  in  the  AAS  protocol 
for  transferring  authority  to  control  a  flight  from  one  air-traffic  controller  to 
another.  The  flight  hand-off  protocol  we  describe  is  the  one  actually  used 
in  the  production  AAS  system  (although  the  protocol  there  is  programmed 
in  Ada).  And,  the  derivation  we  give  is  a  description  of  how  the  protocol 
actually  was  first  obtained. 

The  formal  methods  we  use  are  not  particularly  esoteric  nor  sophisti¬ 
cated.  The  specification  of  the  problem  is  simple,  as  is  the  characterization 
of  hardware  failures  that  it  must  tolerate.  Because  the  hand-off  protocol  is 
short,  computer-aided  support  was  not  necessary  for  the  derivation.  De¬ 
riving  more  complex  protocols  would  certainly  benefit  from  access  to  a 
theorem  prover. 

We  proceed  as  follows.  The  next  section  gives  a  specification  of  the 
problem  and  the  assumptions  being  made  about  the  system.  Section  3 
describes  the  formal  method  we  used.  Finally,  Section  4  contains  our 
derivation  of  the  hand-off  protocol. 

2  Specification  and  System  Model 

The  air-traffic  controller  in  charge  of  a  flight  at  any  time  is  determined  by 
the  location  of  the  flight  at  that  time.  However,  the  hand-off  of  the  flight 
from  one  controller  to  another  is  not  automatic:  some  controller  must  issue 
a  command  requesting  that  the  ownership  of  a  flight  be  transferred  from 
its  current  owner  to  a  new  controller.  This  message  is  sent  to  a  process  that 
is  executing  on  behalf  of  the  new  controller.  It  is  this  process  that  starts  the 
execution  of  the  hand-off  itself. 

The  hand-off  protocol  has  the  following  requirements: 

PI:  No  two  controllers  own  the  same  flight  at  the  same  time. 

P2:  The  interval  during  which  no  controller  owns  a  flight  is  brief  (approx¬ 
imately  one  second). 

P3:  A  controller  that  does  not  own  a  flight  knows  which  controller  does 
own  that  flight. 
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The  hand-off  protocol  is  implemented  on  top  of  AAS  system  software 
that  implements  several  strong  properties  about  message  delivery  and  exe¬ 
cution  time  [1].  For  our  purposes,  we  simplify  the  system  model  somewhat 
and  mention  only  those  properties  needed  by  our  hand-off  protocol 

The  system  is  structured  as  a  set  of  processes  running  on  a  collec¬ 
tion  of  processors  interconnected  with  redundant  networks.  The  services 
provided  by  AAS  system  software  include  a  point-to-point  FIFO  interpro¬ 
cess  communication  facility  and  a  name  service  that  allows  for  location- 
independent  interprocess  communication.  AAS  also  supports  the  notion 
of  a  resilient  process  s  comprising  a  primary  process  s.p  and  a  backup  process 
s.b.  The  primary  sends  messages  to  the  backup  so  that  the  backup's  state 
stays  consistent  with  the  primary.  This  allows  the  backup  to  take  over  if 
the  primary  fails. 

A  resilient  process  is  used  to  implement  the  services  needed  by  an 
air-traffic  controller,  including  screen  management,  display  of  radar  in¬ 
formation,  and  processing  of  flight  information.  We  denote  the  primary 
process  for  a  controller  C  as  C.p  and  its  backup  process  as  C.b.  If  C  is  the 
owner  of  a  flight  /,  then  C.p  can  execute  commands  and  send  messages 
that  affect  the  status  of  flight  /;  C.b,  like  all  backup  processes  in  AAS,  only 
receives  and  records  information  from  C.p  in  order  to  take  over  if  C.p  fails. 

AAS  implements  a  simple  failure  model  for  processes  [3]: 

SI:  Processes  can  fail  by  crashing.  A  crashed  process  simply  stops  exe¬ 
cuting  without  otherwise  taking  any  erroneous  action. 

S2:  If  a  primary  process  crashes,  then  its  backup  process  detects  this  and 
begins  executing  a  user-specified  routine. 

Property  S2  is  implemented  by  having  a  failure  detector  service.  This 
service  monitors  each  process  and,  upon  detecting  a  failure,  notifies  any 
interested  process. 

If  the  hand-off  protocol  runs  only  for  a  brief  interval  of  time,  then 
it  is  safe  to  assume  that  no  more  than  a  single  failure  will  occur  during 
execution.  So,  we  assume: 

S3:  In  any  execution  of  the  hand-off  protocol,  at  most  one  of  the  partici¬ 
pating  processes  can  crash. 

S4:  Messages  in  transit  can  be  lost  if  the  sender  or  receiver  of  the  message 
crashes.  Otherwise,  messages  are  reliably  delivered,  without  corrup¬ 
tion,  and  in  a  timely  fashion.  No  spurious  messages  are  generated. 
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We  also  can  assume  that  messages  are  not  lost  due  to  failure  of  net¬ 
work  components  such  as  controllers  and  repeaters.  This  is  a  reasonable 
assumption  because  the  processors  of  AAS  are  interconnected  with  redun¬ 
dant  networks  and  it  is  assumed  that  no  more  than  one  of  the  networks 
will  fail. 

In  any  long-running  system  in  which  processes  can  fail,  there  must  be  a 
mechanism  for  restarting  processes  and  reintegrating  them  into  the  system. 
We  ignore  such  issues  here  because  that  functionality  is  provided  by  AAS 
system  software.  Instead,  we  assume  that  at  the  beginning  of  a  hand-off 
from  A  to  B,  all  four  processes  A.p,A.b,  B.p,  B.b  are  operational 

3  Fault-tolerance  and  Refinement 

A  protocol  is  a  program  that  runs  on  a  collection  of  one  or  more  processors. 
We  indicate  that  S  is  executed  on  processor  p  by  writing: 

(S)  atp  (1) 

Execution  of  (1 )  is  the  same  as  skip  if  p  has  failed  and  otherwise  is  the  same 
as  executing  S  as  a  single,  indivisible  action.  This  is  exactly  the  behavior 
one  would  expect  when  trying  to  execute  an  atomic  action  S  on  a  fail-stop 
processor. 

Sequential  composition  is  indicated  by  juxtaposition. 

(Si)  at  pi  (S2)atp2  (2) 

This  statement  is  executed  by  first  executing  (Si)  at  pi  and  then  executing 
(S2)  atp2-  Notice  that  execution  of  (S2)  atp2  cannot  assume  that  Si  has 
actually  been  performed.  If  pi  fails  before  execution  of  (Si )  at  pi  completes, 
then  the  execution  of  (Si)  at  pi  is  equivalent  to  skip.  Second,  observe  that 
an  actual  implementation  of  (2)  when  pi  and  pi  are  different  will  require 
some  form  of  message-exchange  in  order  to  enforce  the  sequencing. 
Finally,  parallel  composition  is  specified  by: 

cobegin  (Si)  at  pi  ||  (S2)  at  P2 1| ...  ||  (S„)  at  p„  coend  (3) 

This  statement  completes  when  each  component  (S,)  at  p,  has  completed. 
Since  some  of  these  components  may  have  been  assigned  to  processors  that 
fail,  all  that  can  be  said  when  (3)  completes  is  that  a  subset  of  the  Sj  have 
been  performed.  If,  however,  we  also  know  the  maximum  number  t  of 
failures  that  can  occur  while  (3)  executes,  then  at  least  n  -  t  of  the  Sj  will 
be  performed. 
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Proof  Outlines 

We  use  proof  outlines  to  reason  about  execution  of  a  protocol  A  proof 
outline  is  a  program  that  has  been  annotated  with  assertions,  each  of  which 
is  enclosed  in  braces.  A  precondition  appears  before  each  atomic  action,  and 
a  postcondition  appears  after  each  atomic  action.  Assertions  are  Boolean 
formulas  involving  the  program  variables.  Here  is  an  example  of  a  proof 
outline. 

{i  =  0  A  y  =  0} 

XI :  x  :=  x  +  1 

{x  =  1  A  y  =  0} 

X2 :  y  :=  y  +  1 

{x  =  1  A  y  =  1} 

In  this  example,  x  =  0Ay  =  0,  i  =  1  A  y  =  0,  and  i  =  1  Ay  =  1  are 
assertions.  Assertion  x  =  0Ay  =  0is  the  precondition  of  XI,  denoted 
pre(Xl),  and  assertion  x  =  lAy  =  0is  the  postcondition  of  XI,  denoted 
post (X 2).  The  postcondition  of  XI  is  also  the  precondition  of  X2. 

A  proof  outline  is  valid  if  its  assertions  are  an  accurate  characterization 
of  the  program  state  as  execution  proceeds.  More  precisely,  a  proof  outline 
is  valid  if  the  proof  outline  invariant 

A  ((flf(s)  =*  Pre(s))  A  (*/ter(S)  =►  post (S))) 
s 

is  not  invalidated  by  execution  of  the  program,  where  of  (S)  is  a  predicate 
that  is  true  when  the  program  counter  is  at  statement  S,  and  after( S)  is  a 
predicate  this  is  true  when  the  program  counter  is  just  after  statement  S. 

The  proof  outline  above  is  valid.  For  example,  execution  starting  in 
a  state  where  x  =  1  Ay  =  0A  after(Xl)  is  true  satisfies  the  proof  outline 
invariant  and,  as  execution  proceeds,  the  invariant  remains  true.  Notice, 
our  definition  of  validity  allows  execution  to  begin  anywhere — even  in 
the  middle  of  the  program.  Changing  posf(Xl)  (and  pre(X2))  to  x  =  1 
destroys  the  validity  of  the  above  proof  outline.  (Start  execution  in  state 
x  =  lAy  =  23A  after(X  1).  The  proof  outline  invariant  will  hold  initially 
but  is  invalidated  by  execution  of  X2.) 

A  simple  set  of  (syntactic)  rules  can  be  used  to  derive  valid  proof  out¬ 
lines.  The  first  such  programming  logic  was  proposed  in  [2],  The  logic  that 
we  vise  is  a  variant  of  that  one,  extended  for  concurrent  programs  [4]. 
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Additional  extensions  are  needed  to  derive  a  proof  outline  involving 
statements  like  (1).  Here  is  a  rule  for  (1);  it  uses  the  predicate  up(p)  to  assert 
that  processor  p  has  not  failed. 

Action  at  Processor:  ,  ...  - tpt 

{A}  (S)  at  p{up(p)=>B} 

Since  execution  of  (S)  at  p  when  p  has  crashed  is  equivalent  to  a  skip, 
one  might  think  that 

{A}  (S)  at  p  {( up(p )  =>  B)  A  (^up(p)  =>  A)  (4) 

should  be  valid  if  {A}  S  { B }  is.  Proof  outline  (4),  however,  is  not  valid. 
Consider  an  execution  that  starts  in  a  state  satisfying  A  and  suppose  p  has 
not  crashed.  According  to  the  rule's  hypothesis,  execution  of  S  would 
produce  a  state  satisfying  B.  If  process  p  then  crashed,  the  state  would 
satisfy  ->up(p)  A  B.  Unless  B  implies  A,  the  postcondition  of  (4)  no  longer 
holds. 

The  problem  with  (4)  is  that  the  proof  outline  invariant  is  invalidated 
by  a  processor  failure.  The  predicate  up(p)  changing  value  from  true  to  false 
causes  the  proof  outline  invariant  to  be  falsified.  We  define  a  proof  outline 
to  be  fault-invariant  with  respect  to  a  class  of  failures  if  the  proof  outline 
invariant  is  not  falsified  by  the  occurence  of  any  allowable  subset  of  those 
failures. 

For  the  hand-off  protocol,  we  are  concerned  with  tolerating  a  single 
processor  failure.  We,  therefore,  are  concerned  with  proof  outlines  whose 
proof  outline  invariants  are  not  falsified  when  up(p)  becomes  false  for  a  sin¬ 
gle  processor  (provided  up(p)  is  initially  true  for  all  processors).  Checking 
that  a  proof  outline  is  fault-invariant  for  this  class  of  failures  is  simple: 

Fault-Invariance:  For  each  assertion  A: 

(A  A  /\up(p))=>  /\A[up(p'):=  false] 

p  v> 

where  L[x:=  e]  stands  for  L  with  every  free  occurrence  of  x  replaced  by  e. 

4  Derivation  of  the  Hand-off  Protocol 

Let  CTR(/)  be  the  set  of  controllers  that  own  flight  /.  Property  PI  can  then 
be  restated  as 


PV:  |CTR(/)|  <  1. 

Desired  is  a  protocol  Xfer(A,  B)  satisfying 

{A  €  CTR(f)  A  PI'} 

Xfer(A,  B) 

{B  €  CTR(f)  A  PV) 

such  that  PV  holds  throughout  the  execution  of  Xfer(A, 8). 

A  simple  implementation  of  this  protocol  would  be  to  use  a  single 
variable  ctr(f)  that  contains  the  identity  of  the  controller  of  flight  /  and  to 
change  ctr(  f)  with  an  assignment  statement: 

{A  e  ctr(f)  A  PI'} 
ctr(f):=  (ctr(f)  -  {A})U  {B} 

{B  €  ctr(f)  A  PI'} 

This  implementation  is  problematic  because  the  variable  ctr(f)  must 
reside  at  some  site.  Not  only  does  this  lead  to  a  possible  performance 
problem,  but  it  makes  determining  the  owner  of  /  dependent  on  the  avail¬ 
ability  of  this  site.  Therefore,  we  represent  CTR(/)  with  a  Boolean  variable 
C.ctr{f)  at  each  site  C,  where 

CTR(f):{C\C.ctr(f)}. 

By  doing  so,  we  now  require  at  least  two  separate  actions  in  order  to 
implement  Xfer(A,B) — one  action  that  changes  Axtr(f)  and  one  action 
that  changes  B.ctr(f).  Using  the  Action  at  Processor  Rule,  we  get: 

{A  €  CTR(/)  A  PI'} 

XI :  (A.ctr(/):=  false)  at  A 

{( up(A )  =>  ((. A  l  CTR(f))  A  (CTR(/)  =  0)))  A  PI'} 

{CTR(f)  =  0} 

X2 :  { B.ctr(f):=  true)  at  B 

{( MB )  =>  (B  6  CTR(/)))  A  PI')} 

Note  that  pre(X2)  must  assert  that  CTR(f)  =  0  holds,  since  otherwise  exe¬ 
cution  of  X2  invalidates  PI'. 

The  preconditions  of  XI  and  X2  are  mutually  inconsistent,  so  these 
statements  cannot  be  executed  in  parallel.  Moreover,  X2  cannot  be  run  first 
because  pre(X 2),  CTR(f)  =  0,  does  not  hold  in  the  initial  state.  Thus,  X2 
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must  execute  after  XI.  UnfortunaHy,  posf(Xl)  does  not  imply  pre(X2);  if 
up(A)  does  not  hold,  then  we  cannot  assert  that  CTR(f)  =  0  is  true.  This 
should  not  be  surprising:  if  A  fails,  then  it  cannot  relinquish  ownership. 

One  solution  for  this  availability  problem  is  to  employ  a  resilient  pro¬ 
cess.  That  is,  each  controller  C  will  have  a  primary  process  C.pand  a  backup 
process  C.b  executed  on  processors  that  foil  independently.  Each  process 
has  its  own  copy  of  C£tr(f),  and  these  copies  will  be  used  to  represent 
Cx*r(f)  in  a  manner  that  tolerates  a  single  processor  failure: 


C.ctr(f) : 


{ 


C.p.ctr(f) 

C.b.ctr(f) 


ifup(C.p) 
if  ->up(C.p) 


Since  we  assume  that  there  is  at  most  one  failure  during  execution  of 
the  protocol,  the  above  definition  never  references  the  variable  of  a  failed 
process.  Replacing  references  to  processor  "A"  in  Statement  XI  with  “A.p" 
produces  the  following: 


Xla : 


{A  €  CTR{f)  A  PI'} 

{A.p.ctr{f):~  false)  at  A.p 

{(up(A.p)  =>  ((A  *  CTR(f ))  A  (CTR(f)  =  0)))  A  PI'} 


This  proof  outline  is  not  fault-invariant,  however.  If  A.p  were  to  fail 
when  the  precondition  holds,  then  the  precondition  might  not  continue 
to  hold.  In  particular,  if  A  €  CTR(f)  holds  because  A.p.ctr(f)  is  true  and 
A.b£tr{f)  happens  to  be  false,  then  when  A.p  fails,  A  £  CTR(f)  would  not 
hold.  We  need  to  assert  that  A.pxtr(f)  =  A.bxtr(f)  also  holds  whenever 
pre(Xla)  does.  We  express  this  condition  using  the  following  definition: 


Pr.  (up(A.p)  A  up(A.b))  =>  ( A.b£tr(f )  =  A.pjctr{f)) 


Note  that  if  one  of  A.p  or  A.b  has  failed  then  A.p.ctr{f)  and  A.b£tr(f )  need 
not  be  equal.  Adding  Pr  to  pre(Xla)  gives  the  following  proof  outline, 
which  is  fault-invariant  for  a  single  failure: 

{A  e  CTR(f)  A  PV  A  Pr} 

Xla:  (A.p.ctr(f):=  false)  at  A.p 

{(up(A.p)  =>  ((A  i  CTR(f))  A  (CTK(/)  =  0)))  A  PI'} 


We  need  more  than  just  Xla  to  implement  XI,  however.  Xla  does  not 
re-establish  Pr,  which  must  hold  for  subsequent  ownership  transfers.  This 
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suggests  that  A.bxtr(f)  also  be  updated.  Another  problem  with  Xla  is 
that  post(Xla)  still  does  not  imply  pre(X 2):  if  up(A.p)  does  not  hold,  then 
CTR(/)  =  0  need  not  hold. 

An  action  whose  postcondition  implies 

up(A.b)  =>  (-. A.bxtr(f )  A  (-up(A.p)  =>  (CTR(f)  =  0)) 

suffices.  By  our  assumption,  up(A.p)  V  up(A.b)  holds,  so  this  postcondition 
and  posf(Xla)  will  together  allow  us  to  conclude  CTR(f)  =  0  holds,  thereby 
establishing  pre(X 2).  Here  is  an  action  that,  when  executed  in  a  state 
satisfying  pre(Xla),  terminates  with  the  above  assertion  holding: 

{(A  €  CTR(/))  A  PV  A  Pr } 

Xlb:  (A.bxtr(f):=  false)  at  A.b 

{up(A.b)  =>  (~<A.b.ctr(f)  A  (~>up(A.p)  =»  (CTR(/)  =  0)))} 

One  might  think  that  since  Xla  and  Xlb  have  the  same  preconditions  they 
could  be  run  in  parallel,  and  the  design  of  the  first  half  of  the  protocol 
would  be  complete.  Unfortunately,  we  are  not  done  yet. 

The  original  protocol  specification  implicitly  restricted  permissable  own¬ 
ership  transitions.  Written  as  a  regular  expression,  the  allowed  sequence 
of  states  is: 

0 CTR(f )  =  {A}T  (CTR(f)  =  0;  ( CTR(f )  =  {B})+  (5) 

That  is,  first  A  owns  the  flight,  then  no  controller  owns  the  flight  for  zero 
or  more  states,  and  finally  B  owns  the  flight.  The  proof  outline  above  does 
not  tell  us  anything  about  transitions;  it  only  tells  that  PV  holds  throughout 
(because  PV  is  implied  by  all  assertions).  We  must  strengthen  the  proof 
outline  to  deduce  that  only  correct  transitions  occur. 

A  regular  expression  (like  the  one  above)  can  be  represented  by  a  finite 
state  machine  that  accepts  all  sentences  described  by  the  regular  expression. 
Furthermore,  a  finite  state  machine  is  characterized  by  a  next-state  transi¬ 
tion  function.  The  following  next-state  transition  function  SAB  characterizes 
the  finite  state  machine  for  (5): 

f  {{A},0,{B}}  if  A  €  CTR(f) 

{MB}}  if  CTR(f)  =  0 
(  {{B}}  if  B€CTR(/) 

The  value  of  Sab  is  the  set  of  values  of  CTR(f)  that  are  next  allowed  for  the 
protocol.  For  example,  when  CTR(/)  =  0  holds,  SAB  says  that  a  transition  to 


9 


a  state  in  which  CTR(/)  =  0  holds  or  to  a  state  in  which  CTR(f)  =  {B}  holds 
are  the  only  permissible  transitions.  Note  that  since  PV  holds,  we  know 
that  the  three  cases  A  e  CTR(f),CTR(f)  =  0,B  €  CTR(/)  are  mutually 
exclusive,  so  Sab  always  has  a  unique  value. 

We  further  define  QSab  to  be  the  value  of  Sab  in  the  previous  state 
during  the  execution  of  the  hand-off  protocol,  or  {{A},0,  {B}}  if  there  is 
no  previous  state.  Our  hand-off  protocol  only  will  make  permissable  state 
transitions  provided  each  assertion  implies  that  CTR(/)  e  QSab ;  that  is, 
provided  the  current  owner  of  /  is  one  of  the  owners  that  was  acceptable 
as  the  "next  owner"  in  the  previous  state  of  the  system. 

We  therefore  add  conjunct  CTR(f)  €  QSab  to  the  assertions  in  the  proof 
outline  and  check  to  see  if  the  stronger  proof  outline  is  valid.  If  it  is  valid, 
then  we  can  move  on  to  implementing  X2,  the  second  part  of  Xfer(/1,  B); 
otherwise,  we  will  have  reason  to  make  further  modifications. 

Here  is  the  (strengthened)  proof  outline  with  XI  a  and  XI  b  running  in 
parallel: 

{ CTR(f )  €  QSab  a  A  €  CTR(f)  A  PI'  A  Pr } 

cobegin 

{CTR(/)  €  QSab  A  A  €  CTR(/)  A  PV  A  Pr} 

Xla  :  {A.p.ctr{f):~  false)  at  A.p 
{CTR(f)  €  QSab  a 

(up(A.p)  =>  (A  i  CTR(/)  A  ( CTR(f )  =  0))  A  PV} 

||  {CTR(f)  €  QSab  A  (A  €  CTR(/))  A  PV  A  Pr} 

Xlb:  (A.b.ctr(f):=  false)  at  A.b 
{CTR(f)  e  QSab  a 

{up(A.b)  =>  (-iA.b.ctr(f)  A  (-iup(A.p)  =>  (CTR(f)  =  0)))} 

coend 

{ CTR(f )  e  QSab  A 

(up(A.p)  *(A<t.  CTR(f)  A  (CTR(/)  =  0)))  A 
(up(A.b)  =>  (~>A.b.ctr(f)  A  ^up(A.p)  =*  (CTR(/)  =  0))) 

A  PV  A  Pr} 

Unfortunately,  this  proof  outline  is  not  fault-invariant.  If  A.p  fails  in  a  state 
satisfying  after(X la)  A  af(Xlb)  then  the  following  holds  before  the  failure: 

after(X  la)  A  of(Xlb)  A  ttp(A.p) 

A  up{A.b)  A  (CTR(/)  =  0) 

A  ->A.p.ctr(f)  A  A.b.ctr(f)  A  QSab  =  {0,  B} 
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After  the  failure,  we  have: 

after(X la)  A  of(Xlb)  A  ->up(A.p) 

A  up(A.ft)  A  (A  €  CTR(/)) 

A  ->A.pxtr(f)  A  A.bjctr(f)  A  =  {0,  B} 

So,  CTR(f)  €  06ab  does  not  hold  after  the  failure,  and  the  first  conjunct  of 
posf(Xla)  is  invalidated.  One  simple  solution  is  to  preclude  states  where 
at (Xlb)  A  after(X\a)  holds.  This  can  be  done  by  running  the  two  actions  in 
sequence — first  Xlb  and  then  Xla.  The  result  is  described  by  the  following 
proof  outline: 


(CTR(/)  €  Q6ab  A  PV  A  Pr  A  A  €  CTR(/)} 

Xlb:  (A.b.ctr(f):=  false)  at  A.b 

{CTR(/)  €  A  PV  A  A.p.ctr(f)  A 
(up(A.b)  =>  ->A.bxtr(f))} 

Xla  :  (A.p.ctr(f):=  false)  at  A.p 

{CTR(f)  €  Q6AB  A  PV  A  (vp(A.fc)  =►  ->A.6.dr(/)) 

A  (up(A.p)  =>  -iA.pxtr(f))  A  Pr} 
therefore,  according  to  the  definitions  of  CTR(/)  and  A.ctr(f), 
{CTR{f)€Q6AB*AtCrR(f)/^PV  A  Pr} 

What  we  really  want  to  conclude  in  post  (Xla),  however,  is  CTR(/)  =  0 — 
not  just  A  £  CTR(f).  This  is  easily  done  by  strengthening  the  above  proof 
outline  with  the  following: 

POnly(A ):  For  all  controllers  C:C  ^  A:  C  0  CTR(/) 

POnly(A)  is  initially  true  because  A  €  CTR(/)  A  PV  holds.  It  is  not 
invalidated  by  any  assignment,  because  the  only  variables  assigned  to  are 
those  of  A.p  and  A.b.  So,  POnly(A)  remains  true  throughout  the  execution 
of  XI. 

The  derivation  of  a  protocol  for  X2  is  basically  the  same,  except  that  A 
is  replaced  by  8  and  false  is  interchanged  by  true.  Doing  so  results  in  the 
proof  outline  shown  in  Figure  1. 

4.1  Implementing  P3 

Property  P3  of  Section  2  is  satisfied  by  the  protocol  in  Figure  1  as  long  as 
there  are  exactly  two  controllers.  When  there  are  more  than  two  controllers. 
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{CTR(f)  €  QSab  APT  A  Pr  A  POnly(A)  A  A  €  CTR(/)} 
Xlb:  (A.bxtr(fy.=  false)  at  A.b 

{CTR(f)  e  QSab  A  PI'  A  POnly(A )  A 
A.pxtr(f)  AB  CTR(f)  A  (up(A.fc)  =»  -vU.ctr(/))} 
Xla  :  (A.p.ctr(f):=  false)  at  Ap 

{CTR(/)  €  API' A  Pr  A  POnly(A) 

A  B  #  CTR(f) 

A  ( up(A.b )  =»  ->A.bxtr(f)) 

A  ( up(A.p )  =>  -u4.p.ctr(/))} 
therefore,  according  to  the  definitions  of  CTR(/), 
A.ctr(f ),  and  POnly(B) 

{ CTR(f )  €  eSAB  A  PV  A  Pr  A  POnly(B)  A  CTR(f)  =  0} 
X2b :  (B.b.ctr(f):=  true)  atB.6 

{CTR(/)  €  0bAB  A  PI'  A  POnly{B )  A  -^B.p.ctr(f)  A 
(up(B.b)  =»  B.6.cfr(/))} 

X2a :  { B.pxtr(f ):=  true)  at  B.p 

(CTR(/)  €  0SAB  API1  A  Pr  A  POnly{B) 

A  (up(B.fr)  =>  B.bxtr(f))  A  ( up(B.p )  =►  B.p.ctr(f)) 
therefore,  according  to  the  definitions  of  CTR(f) 
and  Bxfr(/), 

{B  €  0«ab  A  PI'  A  Pr  A  POnly(B)  A  B  6  CTP(/)} 


Rgure  1:  Hand-off  Protocol  for  A  and  B 
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a  controller  must  query  other  controllers  in  order  to  determine  which  owns 
a  flight.  Doing  so  is  inefficient,  so  we  instead  consider  having  each  con¬ 
troller  C  maintain  a  variable  CxtrID(f )  that  names  the  owner  of  flight  /. 
As  with  C£tr(f  \  we  represent  the  value  of  CxirlD(f)  in  a  manner  that 
tolerates  a  single  site  failure: 


C.ctrID(f ): 


C.p.ctrID(f) 

C.b.ctrID(f) 


if  up(C.p) 
if  -iup(C.p) 


(6) 


This  variable  can  be  used  to  implement  the  Boolean  Cxtr(f)  by  defining: 


C.ctr(f):  ( C.ctrID(f )  =  C) 


Thus,  the  assignment  "C.cfr(/):=  true "  would  be  replaced  by  "C.ctrID(f):= 
C",  and  “C.ctr(f):=  false f’  would  be  replaced  by  “C.ctrID(J):=  X"  for  any 
value  X  ^  C. 

We  can  rewrite  P3  as  the  following: 

P3T:  (3 C:  ( C.ctrID(f )  =  C ))  => 

(3 C:  (CxtrlD(f)  =  C)  A  (VC":  C'.ctrlDU)  =  C)). 

For  the  protocol  of  Figure  1,  PS'  holds  when  of(X2b)  is  true  because  the 
the  antecedent  is  false.  Furthermore,  if  we  explicitly  assign  A.ctrID(f):~  B 
as  the  assignments  Xlb  and  Xla,  then  P3f  holds  throughout  the  execution, 
provided  C'  ranges  over  the  set  (A,  B}.  For  the  other  controllers,  additional 
statements  are  needed,  shown  in  Figure  2. 

Since  Z (A,  B)  in  Figure  2  changes  the  values  of  C.ctrlD(f),  it  should 
be  executed  when  CTR(f)  =  0  holds,  because  otherwise  its  execution  may 
violate P3f.  Thus,Z(A,  B) wouldhavetobestartednoearlierthana/iter(Xla) 
and  terminate  by  af(X2a).  Unfortunately,  Z(A,  B)  may  take  a  significant 
amount  of  time — even  though  its  component  statements  can  be  executed 
in  parallel,  the  time  to  execute  Z(A,  B)  will  include  some  communication 
and  synchronization  overhead.  This  extra  time  could  make  satisfying  P2 
hard  or  impossible. 

Property  P3f  is  perhaps  a  bit  too  strong.  In  fact,  all  that  is  really  required 
is  that  a  controller  be  able  to  communicate  with  the  process  that  owns  a 
flight.  For  example,  C.ctrID(f)  could  be  the  start  of  a  path  of  controllers, 
terminating  with  the  current  owner.  The  scheme  where  C.ctrID(f)  indicates 
the  current  owner  is  equivalent  to  requiring  that  this  path  have  a  length  of 
1.  But,  longer  paths  are  also  acceptable. 

Let  C  C'  denote  that  CxtrlD(f)  =  C',  and  let  C  C'  denote  the 
transitive  closure  of  Using  this  notation,  P3f  can  be  expressed  as: 
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{true} 

Z(A,  B) :  cobegin 

II c-.ct{A#y-  {true} 

Zp  :  (C.p£trID(f):=  B)  at  C.p 

{up(C.p)  =>  ( C.P£trlD(f )  =  B)} 

II C:Ct{A,B}-  {f™*} 

Zb :  (C.bxtrID(f):=  B)  at  C.b 

\up(C.b)  =►  ( C.b£trID(f )  =  B)} 

coend 

{up(C.p)  =>  ( C.p.ctrID(f )  =  B) 

A  up(C.6)  =>  ( C.b£trlD{{ )  =  B)} 
therefore,  according  to  the  definitions  of  C.ctrlD(f) 
{(VC*{A,B}:C~B)} 


Figure  2:  Hand-off  Protocol  for  Controllers  other  than  A  and  B 

P3':  (3C  :  C  C)  =>  (3C  :  C  ~  C  A  (VC' :  C'  ^  C)). 

We  weaken  P3  as  follows: 

P3":  (3C :  C  ^  C)  =»  (3C  :  C  C  A  (VC' :  C'  C)). 

P3"  is  left  invariant  by  the  protocol  in  Figure  1.  P3>‘  is  also  an  invariant 
of  the  protocol  of  Figure  2  provided  B  ^  B  V  B  ^  A  initially  holds.  From 
post(Z(A,  B))  and  posf(X2a),  we  conclude  that  as  long  as  the  execution 
of  Z (A,  B)  completes  before  another  hand-off  starts,  P3  will  hold  once 
Z(A,  B)  and  the  protocol  in  Figure  1  have  both  terminated.  Since  P3f  implies 
B  B  v  B  A,  the  system  is  once  again  in  a  state  from  which  a  hand-off 
can  be  performed.  Hence,  Z(A,  B)  can  begin  executing  at  any  point  during 
the  hand-off  from  A  to  B — because  its  precondition,  B-vBvB^A,  holds 
throughout  the  protocol  of  Figure  1.  And,  Z (A,  B)  must  complete  before  a 
subsequent  hand-off  has  started. 

4.2  Implementation  using  Messages 

So  far,  the  protocol  we  have  derived  consists  of  assignment  statements  to 
various  variables  that  reside  on  separate  processes.  The  protocol  consists 
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of  the  three  processes,  as  follows: 
cobegin 

Xlb:  (A.bjctr(f):=  false)  at  A.b 
Xla :  (A.p£tr(f):=  false)  at  A.p 
X2b :  (B.bxtr{f):=  true)  at  B.b 
X2a:  (B.pxtr(f):=  true)  at  B.p 

Zp :  ||c:«{A3) :  (C.p*trID(f)i=  B)  at  C.p 

Zb :  II C:({A.B)  ••  (C.b£trlD(f):=  B)  at  C.b 

coend 

An  actual  implementation  would  require  that  each  assignment  state¬ 
ment  be  executed  by  the  processor  whose  variable  is  being  set.  Further¬ 
more,  the  assignment  statements  of  the  first  process  must  be  sequenced. 
This  sequencing  will  be  accomplished  in  our  implementation  by  processor 
B.p,  since  this  processor  starts  the  protocol.  If  B.pcrashes,  then  B.b  will  take 
over  the  sequencing.  Because  all  assignments  are  constants  to  variables, 
when  taking  over,  B.b  can  simply  start  at  the  beginning  of  the  sequence— it 
not  need  to  know  how  far  B.p  got  before  failing. 

B.b  does  need  to  know  when  B.p  has  finished  executing  the  hand-off 
protocol.  Otherwise,  a  crash  of  B.p  might  cause  B.b  to  re-execute  the  hand- 
off  from  A  to  B  after  /  has  been  later  handed  off  to  another  controller, 
in  which  case  B.b  would  undo  that  later  hand-off.  Hence,  B.b  must  be 
notified  of  the  completion  of  the  hand-off  before  any  subsequent  hand-offs 
are  started.  We  represent  the  fact  that  a  hand-off  from  A  to  B  is  in  progress 
with  a  variable  B.bjcfr,  whose  value  is  initially  ±. 

In  order  to  continue  the  implementation  using  messages,  some  further 
details  of  the  AAS  system  services  must  be  given. 

•  Communication  between  resilient  processes  uses  send  and  receive. 
If  some  process  sends  a  message  m  to  a  resilient  process  C,  then  m  is 
enqueued  at  C.p  if  C.p  has  not  crashed  and  enqueued  at  C.b  if  C.p  has 
crashed.  Furthermore,  send  does  not  return  control  until  the  message 
has  been  enqueued  at  the  remote  process.  The  remote  process  may 
crash  after  enqueueing  m  but  before  delivering  m,  in  which  case  m  is 
lost. 

•  The  primary  of  a  resilient  process  communicates  with  its  backup 
using  log.  Like  send,  log  does  not  return  control  until  the  message  is 
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enqueued  by  the  remote  process.  A  log  that  is  executed  when  there  is 
no  backup  (for  example,  when  the  backup  has  crashed  or  when  log  is 
executed  by  the  backup  itself)  does  nothing  and  immediately  returns 
control. 

•  Until  the  primary  of  a  resilient  process  crashes,  the  backup  delivers 
only  messages  sent  by  log. 

•  When  primary  C.p  crashes,  C.b  takes  over  by  first  processing  any 
enqueued  messages  sent  by  C.p  using  log.  It  then  executes  the  user- 
defined  recovery  protocol.  And,  finally,  it  receives  messages  sent  to 

C. 

We  also  use  a  variable  in  each  process  to  represent  the  value  of  variables 
C.p.ctrID(f)  and  C.b.ctrlD(f).  A  simple  approach  would  be  to  introduce 
C.p.owner(f )  and  C.b. owner (/),  such  that: 

C.p.ctrlD(f):  C.p.oumer(f) 

C.b.ctrID(f):  C.b.oumer(f). 

Doing  so,  however,  is  inefficient  (as  well  as  difficult  given  the  AAS  com¬ 
munication  primitives).  Consider  Xlb  in  the  hand-off  protocol.  To  im¬ 
plement  Xlb,  B.p  would  send  a  message  to  A.b  instructing  it  to  execute 
A.b. owner  (f):=  B.  Since  Xlb  must  complete  before  Xla  starts,  B.p  cannot 
start  Xla  before  A.b  completes  its  assignment.  The  result  is  two  end-to-end 
message  delays. 

A  more  efficient  hand-off  protocol  can  be  implemented  using  the  follow¬ 
ing  definitions  of  C.p.ctrID(f)  and  C.b.ctrlD(f).  Let  the  predicate  Ec(f,X) 
mean  that  C.b  has  enqueued  but  not  yet  processed  a  log  from  C.p  that  re¬ 
quests  the  execution  of  C.b. owner  {f):=  X,  and  let  Vc(f)  be  the  value  of  X 
in  the  most  recent  such  log  message.  Then,  we  define: 

C.p.ctrID(f ):  C.p.owner(f) 


C.b.ctrID(f): 


C.b.owner(f ) 

Vc(f) 


if  (VX:  ->Ec(f,X)) 
if  (3  X:Ec(f,X)) 


B.p  can  cause  the  execution  of  Xlb  followed  by  Xla  simply  by  sending 
a  single  message  to  A  requesting  execution  of  owner(f):=  B.  Upon  delivery 
of  this  message,  A.p  first  executes  a  log  so  A.b  lea  ms  of  the  message.  Since 
log  does  not  return  until  £*(/,  B)  holds,  post  (Xlb)  holds  when  log  returns. 
A.p  can  then  establish  posf(Xla)  by  executing  C.p.owner(f):=  B. 


y 
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cobegin 

{pne(Xlb)} 

(log  "xfr.=  (f,A)")  at  B.p 
(send  "oamer(/):=  B"  to  v4)  at  B.p 
(wait  for  "ack"  from  A)  at  B.p 
{post (XI a)  A  pre (X2b)} 

(log  "owner(f):=  B")  at  B.p 
{posf(X2b)  A  pne(X2a)} 

(B.pmmer(f):=  B)  at  B.p 
{past(X2a)} 

(VC  :  C  g  {A,  B} :  send  “oumer{f):=  B"  to  C)  at  B.p 
(log  "xfr.=  ±”)  at  B.p 

|| c:c?£B :  (when  deliver  * otoner(f):=  X")  at  C.p 
(log  uattmer{f):~  X")  at  C.p 
{(C  =  A)  =>  post(Xlb)  A  {C?  A)  =>  post(Zb)} 
{C.p.oumer(f):=  X)  at  C.p 
{(C  =  i4)  =>  posf(Xla) 

A(C  A)  =>■  (posf(Zp)  A  post(Zb))} 

(send  "ack"  to  X)  at  C.p 

||c :  (when  deliver  "*:=  v"  from  C.p  do  C.b.x:=  v)  at  C.b 

||e :  (when  C.p  fails  do 

ifxfr=(f,X) 

then  start  hand-off  of  /  from  X  to  C)  at  C.b 

coend 


Figure  3:  Complete  Hand-off  Protocol 


17 


The  complete  hand-off  protocol  is  shown  in  Figure  3.  The  assertions  in 
the  code  refer  to  Figures  1  and  2. 
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